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(54) Formant-based speech synthesizer employing demi-syllable concatenation with 
independent cross fade in the filter parameter and source domains 



(57) The concatenative speech synthesizer 
employs demi-syllable subword units to generate 
speech. The synthesizer is based on a source-filter 
model that uses source signals that correspond closely 
to the human glottal source and that uses filter parame- 
ters that correspond closely to the human vocal tract. 
Concatenation of the demi-syllable units is facilitated by 
two separate cross fade techniques, one applied in the 

A 



time domain to the demi-syllable source signal wave- 
forms, and one applied in the frequency domain by 
interpolating the corresponding filter parameters of the 
concatenated demi-syllables. The dual cross fade tech- 
nique results in natural sounding synthesis that avoids 
time-domain glitches without degrading or smearing 
charat eristic resonances in the filter domain. 
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Description 

Background and Summ ary of the Invention 



[0001] The present invention relates generally to 
speech synthesis and more particularly to a concatena- 
te synthesizer based on a source-filter model in which 
the source signal and filter parameters are generated by 
independent cross fade mechanisms. 
[0002] Modern day speech synthesis involves many 
tradeoffs. For limited vocabulary applications, it is usu- 
ally feasible to store entire words as digital samples to 
be concatenated into sentences for playback. Given a 
good prosody algorithm to place the stress on the 
appropriate words, these systems tend to sound quite 
natural, because the individual words can be accurate 
reproductions of actual human speech. However, for 
larger vocabularies it is not feasible to store complete 
word samples of actual human speech. Therefore, a 
number of speech synthesists have been experimenting 
with breaking speech into smaller units and concatenat- 
ing those units into words, phrases and ultimately sen- 
tences. 

[0003] Unfortunately, when concatenating sub- 
word units, speech synthesists must confront several 
very difficult problems. To reduce system memory 
requirements to something manageable, "it is necessary 
to develop versatile sub-word units that can be used to 
form many different words. However, such versatile sub- 
word units often do not concatenate well. During play- 
back of concatenated sub-word units, there is often a 
very noticeable distortion or glitch where the sub-word 
units are joined. Also, since the sub-word units must be 
modified in pitch and duration, to realize the intended 
prosodic pattern, most often a distortion is incurred from 
current techniques for making these modifications. 
Finally, since most speech segments are influenced 
strongly by neighboring segments, there is not a simple 
set of concatenation units (such as phonemes or 
diphones) which can adequately represent human 
speech. 

[0004] A number of speech synthesists have sug- 
gested various solutions to the above concatenation 
problems, but so far no one has successfully solved the 
problem. Human speech generates complex time-vary- 
ing waveforms that defy simple signal processing solu- 
tions. Our work has convinced us that a successful 
solution to the concatenation problems will arise only in 
conjunction with the discovery of a robust speech syn- 
thesis model. In addition, we will need an adequate set 
of concatenation units, and the further capability of 
modifying these units dynamically to reflect adjacent 
segments. 

[0005] The formant-based speech synthesizer of 
the invention is based upon a source-filter model that 
closely ties the source and filter synthesizer compo- 
nents to physical structures within the human vocal 
tract. Specifically, the source model is based on a best 



estimate of th source signal produced at the glottis, 
and the filter model is based on the resonant (formant- 
producing) structures generally above the glottis. For 
this reason, we call our synth sis technique "formant- 
5 based" synthesis. We believe that modeling the source 
and filter components as closely as possible to actual 
speech production mechanisms produces far more nat- 
ural sounding synthesis that other existing techniques. 
[0006] Our synthesis technique involves identifying 
70 and extracting the formants from an actual speech sig- 
nal (labeled to identify approximate demi-syllable areas) 
and then using this information to construct demi-sylla- 
ble segments each represented by a set of filter param- 
eters and a source signal waveform. The invention 
is provides a novel cross fade technique to smoothly con- 
catenate consecutive demi-syllable segments. Unlike 
conventional blending techniques, our system allows us 
to perform cross fade in the filter parameter domain 
while simultaneously but independently performing 
20 "cross fade" (parameter interpolation) of the source 
waveforms in the time domain. The filter parameters 
model vocal tract effects, while the source waveforms 
model the glottal source. The technique has the advan- 
tage of restricting prosodic modification to only the glot- 
25 tal source, if desired. This can reduce distortion usually 
associated with the conventional blending techniques. 
[0007] The invention further provides a system 
whereby interaction between initial and final demi-sylla- 
bles can be taken into account. Demi-syllables repre- 
30 sent the presently preferred concatenation unit. Ideally, 
concatenation units are selected at points of least co- 
articulatory effect. The syllable is a natural unit for this 
purpose, but choosing the syllable requires a large 
amount of memory. For systems with limited available 
35 memory, the demi-syllable is preferred. In the preferred 
embodiment we take into account how the initial and 
final demi-syllaWes within a given syllable interact with 
each other. We further take into account how demi-syl- 
lables across word boundaries and sentence bounda- 
40 ries interact with each other. This interaction information 
is stored in a waveform database containing not only the 
source waveform data and filter parameter data, but 
also the necessary label or marker data and context 
data used by the system in applying formant mod'rfica- 
45 tion rules. The system operates upon an input phoneme 
string by first performing unit selection, then building an 
acoustic string of syllable objects and then rendering 
those objects by performing the cross fade operations in 
both source signal and filter parameter domains. The 
so resulting output are source waveforms and filter param- 
eters that may then be used in a source-filter model to 
generate synthesized speech. 
[0008] The result is a natural sounding speech syn- 
thesizer that can be incorporated into many different 
55 consumer products. Although the techniques can be 
applied to any speech coding application, the invention 
is well suited for use as a concatenate speech synthe- 
sizer, suitable for use in text-to-speech applications. 
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This system is designed to work within the current mem- 
ory and processor constraints found in many consumer 
applications. In other words, the synthesizer is designed 
to fit into a small memory footprint, while providing bet- 
ter sounding synthesis than other synthesizers of larger 
size. 

[0009] For a more complete understanding of the 
invention, its objects and advantages, refer to the follow- 
ing specification and to the accompanying drawings. 

Brief Descrip tion of the Drawings 

[0010] 

Figure 1 is a block diagram, illustrating the basic 
source-filter model with which the invention may be 
employed; 

Figure 2 is a diagram of speech synthesizer tech- 
nology, illustrating the spectrum of possible source- 
filter combinations, particularly pointing out the 
domain in which the synthesizer of the present 
invention resides; 

Figure 3 is a flowchart diagram illustrating the pro- 
cedure for constructing waveform databases used 
in the present invention; 

Figure 4A and 4B comprise a flowchart diagram 
illustrating the synthesis process according to the 
invention. 

Figure 5 is a waveform diagram illustrating time 
domain cross fade of source waveform snippets; 
Figure 6 is a block diagram of the presently pre- 
ferred apparatus useful in practicing the invention; 
Figure 7 is a flowchart diagram illustrating the proc- 
ess in accordance with the invention 

Detaiied Description of the Preferred Embodiment 

[001 1 ] While there have been many speech synthe- 
sis models proposed in the past, most have in common 
the following two component signal processing struc- 
ture. Shown in Figure 1 , speech can be modeled as an 
initial source component 10, processed through a sub- 
sequent filter component 12. 

[001 2] Depending on the model, either source or fil- 
ter, or both can be very simple or very complex. For 
example, one earlier form of speech synthesis concate- 
nated highly complex PCM (Pulse Code Modulated) 
waveforms as the source, and a very simple (unity gain) 
filter. In the PCM synthesizer all apriori knowledge was 
imbedded in the source and none in the filter. By com- 
parison, another synthesis method used a simple 
repeating pulse train as the source and a comparatively 
complex filter based on LPC (Linear Predictive Coding). 
Note that neither of these conventional synthesis tech- 
niques attempted to model the physical structures 
within the human vocal tract that are responsible for pro- 
ducing human speech. 

[0013] The present invention employs a formant- 



based synthesis model that closely ties the source and 
filter synthesizer components to the physical structures 
within the human vocal tract. Specifically, the synthe- 
sizer of the present invention bases the source model 

5 on a best estimate of the source signal produced at the 
glottis. Similarly, the filter model is based on the reso- 
nant (formant producing) structures located generally 
above the glottis. For these reasons, we call our synthe- 
sis technique "formant-based". 

io [0014] Figure 2 summarizes various source-filter 
combinations, showing on the vertical axis a compara- 
tive measure of the complexity of the corresponding 
source or filter component. In Figure 2 the source and 
filter components are illustrated as side-by-side vertical 

is axes. Along the source axis relative complexity 
decreases from top to bottom, whereas along the filter 
axis relative complexity increases from top to bottom. 
Several generally horizontal or diagonal lines connect a 
point on the source axis with a point on the filter axis to 

20 represent a particular type of speech synthesizer. For 
example, the horizontal line 14 connects a fairly com- 
plex source with a fairly simple filter to define the TD- 
PSOLA synthesizer, an example of one type of well- 
known synthesizer technology in which a PCM source 

25 waveform is applied to an identity filter. Similarly, hori- 
zontal line 16 connects a relatively simple source with a 
relatively complex filter to define another known synthe- 
sizer of the phase vocorder, harmonic synthesizer. This 
synthesizer in essence uses a simple form of pulse train 

30 source waveform and a complex filter designed using 
spectral analysis techniques such as Fast Fourier 
Transforms (FFT). The classic LPC synthesizer is repre- 
sented by diagonal line 17, which connects a pulse train 
source with an LPC filter. The Klatt synthesizer 18 is 

35 defined by a parametric source applied through a filter 
comprised of formants and zeros. 
[0015] In contrast with the foregoing conventional 
synthesizer technology, the present invention occupies 
a location within Figure 2 illustrated generally by the 

40 shaded region 20. In other words, the present invention 
can use a source waveform ranging from a pure glottal 
source to a glottal source with nasal effects present. 
The filter can be a simple formant filter bank or a some- 
what more complex filter having formants and zeros. 

45 [001 6] To our knowledge the prior art concatenate 
synthesis has largely avoided region 20 in Figure 2. 
Region 20 corresponds as close as practical to the nat- 
ural separation in humans between the glottal voice 
source and the vocal tract (filter). We believe that oper- 

50 ating in region 20 has some inherent benefits due to its 
central position between the two extremes of pure time 
domain representation (such as TD-PSOLA) and the 
pure frequency domain representation (such as the 
phase vocorder or harmonic synthesizer). 

55 [0017] The presently preferred implementation of 
our formant-based synthesizer uses a technique 
employing a filter and an inverse filter to extract source 
signal and formant parameters from human speech. 
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The extracted signals and parameters are then used in 
the source-filter model corresponding to region 20 in 
Figure 2. The presently preferred procedure for extract- 
ing source and filter parameters from human speech is 
described later in this specification. The present 
description will focus on other aspects of the formant- 
based synthesizer, namely those relating to selection of 
concatenate units and cross fade. 
[001 8] The formant-based synthesizer of the inven- 
tion defines concatenation units representing small 
pieces of digitized speech that are then concatenated 
together for playback through a synthesizer sound mod- 
ule. The cross fade techniques of the invention can be 
employed with concatenation units of various sizes. The 
syllable is a natural unit for this purpose, but where 
memory is limited choosing the syllable as the basic 
concatenation unit may be prohibitive in terms of mem- 
ory requirements. Accordingly, the present implementa- 
tion uses the demi-syllable as the basic concatenation 
unit. An important part of the formant-based synthesizer 
involves performing a cross fade to smoothly join adja- 
cent demi-syllables so that the resulting syllables sound 
natural and without glitches or distortion. As will be 
more fully explained below, the present system per- 
forms this cross fade in both the time domain and the 
frequency domain, involving both components of the 
source-filter model: the source waveforms and the form- 
ant filter parameters. 

[0019] The preferred embodiment stores source 
waveform data and filter parameter data in a waveform 
database. The database in its maximal form stores dig- 
itized speech waveforms and filter parameter data for at 
least one example of each demi-syllable found in the 
natural language (e.g. English). In a memory-conserv- 
ing form, the database can be pruned to eliminate 
redundant speech waveforms. Because adjacent demi- 
syllables can significantly affect one another, the pre- 
ferred system stores data for each different context 
encountered. 

[0020] Figure 3 shows the presently preferred tech- 
nique for constructing the waveform database. In Figure 
3 (and also in subsequent Figures 4A and 4B) the 
boxes with double-lined top edges are intended to 
depict major processing block headings. The single- 
lined boxes beneath these headings represent the indi- 
vidual steps or modules that comprise the major block 
designated by the heading block. 
[0021] Referring to Figure 3, data for the waveform 
database is constructed as at 40 by first compiling a list 
of demi-syllables and boundary sequences as depicted 
at step 42. This is accomplished by generating all possi- 
ble combinations of demi-syllables (step 44) and by 
then excluding any unused combinations as at 46. Step 
44 may be a recursive process whereby alt different per- 
mutations of initial and final demi-syllables are gener- 
ated. This exhaustive list of all possible combinations is 
then pruned to reduce the size of the database. Pruning 
is accomplished in step 46 by consulting a word diction- 



ary 48 that contains phon tic transcriptions of all words 
that the synthesizer will pronounce. These phonetic 
transcriptions are used to weed out any demi-syllable 
combinations that do not occur in the words the synthe- 

5 sizer will pronounce. 

[0022] The preferred embodiment also treats 
boundaries between syllables, such as those that occur 
across word boundaries or sentence boundaries. These 
boundary units (often consonant clusters) are con- 

10 structed from diphones sampled from the correct con- 
text. One way to exclude unused boundary unit 
combinations is to provide a text corpus 50 containing 
exemplary sentences formed using the words found in 
word dictionary 48. These sentences are used to define 

15 different word boundary contexts such that boundary 
unit combinations not found in the text corpus may be 
excluded at step 46. 

[0023] After the list of demi-syllables and boundary 
units has been assembled and pruned, the sampled 

20 waveform data associated with each demi-syllable is 
recorded and labeled at step 52. This entails applying 
phonetic markers at the beginning and ending of the rel- 
evant portion of each demi-syllable, as indicated at step 
54. Essentially, the relevant parts of the sampled wave- 

25 form data are extracted and labeled by associating the 
extracted portions with the corresponding demi-syllable 
or boundary unit from which the sample was derived. 
[0024] The next step involves extracting source and 
filter data from the labeled waveform data as depicted 

so generally at step 56. Step 56 involves a technique 
described more fully below in which actual human 
speech is processed through a filter and its inverse filter 
using a cost function that helps extract an inherent 
source signal and filter parameters from each of the 

35 labeled waveform data. The extracted source and filter 
data are then stored at step 58 in the waveform data- 
base 60. The maximal waveform database 60 thus con- 
tains source (waveform) data and filter parameter data 
for each of the labeled demi-syllables and boundary 

40 units. Once the waveform database has been con- 
structed, the synthesizer may now be used. 
[0025] To use the synthesizer an input string is sup- 
plied as at 62 in Figure 4A. The input string may be a 
phoneme string representing a phrase or sentence, as 

45 indicated diagrammatically at 64. The phoneme string 
may include aligned intonation patterns 66 and syllable 
duration information 68. The intonation patterns and 
duration information supply prosody information that the 
synthesizer may use to selectively after the pitch and 

so duration of syllables to give a more natural human-like 
inflection to the phrase or sentence. 
[0026] The phoneme string is processed through a 
series of steps whereby information is extracted from 
the waveform database 60 and rendered by the cross 

55 fade mechanisms. First, unit selection is performed as 
indicated by the heading block 70. This entails applying 
context rules as at 72 to determine what data to extract 
from waveform database 60. The context rules, 
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depicted diagrammatical ly at 74, specify which demi- 
syllable or boundary units to extract from the database 
under certain conditions. For example, if the phoneme 
string calls for a demi-syllabl that is directly repre- 
sented in the database, then that demi-syllable is 
selected. The context rules take into account the demi- 
syllables of neighboring sound units in making selec- 
tions from the waveform database. If the required demi- 
syllable is not directly represented in the database, then 
the context rules will specify the closest approximation 
to the required demi-syllable. The context rules are 
designed to select the demi-syllables that will sound 
most natural when concatenated. Thus the context 
rules are based on linguistic principles. 
[0027] By way of illustration: If the required demi- 
syllable is preceded by a voiced bilabial stop (i.e., Ibf) in 
the synthesized word, but the demi-syllable is not found 
in such a context in the database, the context rules will 
specify the next-most desirable context. In this case, the 
rules may choose a segment preceded by a differnet 
bilabial, such as/p/. 

[0028] Next, the synthesizer builds an acoustic 
string of syllable objects corresponding to the phoneme 
string supplied as input. This step is indicated generally 
at 76 and entails constructing source data for the string 
of demi-syllables as specified during unit selection. This 
source data corresponds to the source component of 
the source-filter model. Filter parameters are also 
extracted from the database and manipulated to build 
the acoustic string. The details of filter parameter 
manipulation are discussed more fully below. The pres- 
ently preferred embodiment defines the string of sylla- 
ble objects as a linked list of syllables 78, which in turn, 
comprises a linked list of demi-syllables 80. The demi- 
syllables contain waveform snippets 82 obtained from 
waveform database 60. 

[0029] Once the source data has been compiled, a 
series of rendering steps are performed to cross fade 
the source data in the time domain and independently 
cross fade the filter parameters in the frequency 
domain. The rendering steps applied in the time domain 
appear beginning at step 84. The rendering steps 
applied in the frequency domain appear beginning at 
step 110 (Fig. 4B). 

[0030] Figure 5 illustrates the presently preferred 
technique for performing a cross fade of the source data 
in the time domain. Referring to Figure 5, a syllable of 
duration S is comprised of initial and final demi-syllables 
of duration A and B. The waveform data of demi-syllable 
A appears at 86 and the waveform data of demi-syllable 
B appears at 88. These waveform snippets are slid into 
position (arranged in time) so that both demi-syllables fit 
within syllable duration S. Note that there is some over- 
lap between demi-syllables A and B. 
[0031] The cross fade mechanism of the preferred 
embodiment performs a linear cross fade in the time 
domain. This mechanism is illustrated diagrammatically 
at 90, with the linear cross fade function being repre- 



sented at 92. Note that at time = to demi-syllable A 
receives full emphasis whil demi-syllable B receives 
zero emphasis. At time proceeds to \* d mi-syllable A is 
gradually reduced in emphasis while demi-syllable B is 
5 gradually increased in emphasis. This r suits in a com- 
posite or cross faded waveform for the entire syllable S 
as illustrated at 94. 

[0032] Referring now to Figure 4B, a separate cross 
fade process is performed on the filter parameter data 

to associated with the extracted demi-syllables. The pro- 
cedure begins by applying filter selection rules 98 to 
obtain filter parameter data from database 60. If the 
requested syllable is directly represented in a syllable 
exception component of database 60, then filter data 

is corresponding to that syllable is used as at step 100. 
Alternatively, if the filter data is not directly represented 
as a full syllable in the database, then new filter data are 
generated as at step 1 02 by applying a cross fade oper- 
ation upon data from two demi-syllables in the fre- 

20 quency domain. The cross fade operation entails 
selecting a cross fade region across which the filter 
parameters of successive demi-syllables will be cross 
faded and by then applying a suitable cross fade func- 
tion as at 106. The cross fade function is applied in the 

25 filter domain and may be a linear function (similar to that 
illustrated in Figure 5), a sigmoidal function or some 
other suitable function. Whether derived from the sylla- 
ble exception component of the database directly (as at 
set 100) or generated by the cross fade operation, the 

30 filter parameter data are stored at 108 for later use in 
the source-filter model synthesizer. 
[0033] Selecting the appropriate cross fade region 
and the cross fade function is data dependent. The 
objective of performing cross fade in the frequency 

35 domain is to eliminate unwanted glitches or resonances 
without degrading important dipthongs. For this to be 
obtained cross-fade regions must be identified in which 
the trajectories of the speech units to be joined are as 
similar as possible. For example, in the construction of 

40 the word "house", disyllabic filter units for /haw/- and - 
/aws/ could be concatenated with overlap in the nuclear 
/a/ region. 

[0034] Once the source data and filter data have 
been compiled and rendered according to the preceding 
45 steps, they are output as at 1 1 0 to the respective source 
waveform databank 112 and filter parameters databank 
1 1 4 for use by the source filter model synthesizer 1 1 6 to 
output synthesized speech. 

so Source Signal and Filter Parameter Extraction 

[0035] Figure 6 illustrates a system according to the 
invention by which the source waveform may be 
extracted from a complex input signal. Afilter/inverse-fil- 
55 ter pair are used in the extraction process. 

[0036] In Figure 6, filter 110 is defined by its filter 
model 112 and fitter parameters 114. The present 
invention also employs an inverse fitter 116 that corre- 
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sponds to the inverse of filter 110. Filter 116 would, for 
example, have the same filter parameters as filter 110, 
but would substitute zeros at each location where filter 
110 has pol s. Thus the filter 110 and inverse filter 1 1 6 
define a reciprocal system in which the effect of inverse 
filter 116 is negated or reversed by the effect of filter 
110. Thus, as illustrated, a speech waveform input to 
inverse filter 16 and subsequently processed by filter 
110 results in an output waveform that, in theory, is 
identical to the input waveform. In practice, slight varia- 
tions in filter tolerance or slight differences between fil- 
ters 116 and 110 would result in an output waveform 
that deviates somewhat from the identical match of the 
input waveform. 

[0037] When a speech waveform (or other complex 
waveform) is processed through inverse filter 116, the 
output residual signal at node 120 is processed by 
employing a cost function 122. Generally speaking, this 
cost function analyzes the residual signal according to 
one or more of a plurality of processing functions 
described more fully below, to produce a cost parame- 
ter. The cost parameter is then used in subsequent 
processing steps to adjust filter parameters 114 in an 
effort to minimize the cost parameter. In Figure 1 the 
cost minimizer block 124 diagrammatically represents 
the process by which filter parameters are selectively 
adjusted to produce a resulting reduction in the cost 
parameter. This may be performed iteratively, using an 
algorithm that incrementally adjusts filter parameters 
while seeking the minimum cost. 
[0038] Once the minimum cost is achieved, the 
resulting residual signal at node 120 may then be used 
to represent an extracted source signal for subsequent 
source-filter model synthesis. The filter parameters 114 
that produced the minimum cost are then used as the fil- 
ter parameters to define filter 1 1 0 for use in subsequent 
source-filter model synthesis. 
[0039] Figure 7 illustrates the process by which the 
source signal is extracted, and the filter parameters 
identified, to achieve a source-filter model synthesis 
system in accordance with the invention. 
[0040] First a filter model is defined at step 1 50. Any 
suitable filter model that lends itself to a parameterized 
representation may be used. An initial set of parameters 
is then supplied at step 152. Note that the initial set of 
parameters will be iteratively altered in subsequent 
processing steps to seek the parameters that corre- 
spond to a minimized cost function. Different techniques 
may be used to avoid a sub-optimal solution corre- 
sponding to a local minima. For example, the initial set 
of parameters used at step 152 can be selected from a 
set or matrix of parameters designed to supply several 
different starting points in order to avoid the local 
minima. Thus in Figure 7 note that step 1 52 may be per- 
formed multiple times for different initial sets of parame- 
ters. 

[0041 ] The filter model defined at 1 50 and the initial 
set of parameters defined at 152 are then used at step 



154 to construct a filter (as at 156) and an inverse filter 
(as at 158). 

[0042] Next, the speech signal is applied to the 
inverse filter at 1 60 to extract a residual signal as at 1 64. 

5 As illustrated, the preferred embodiment uses a Han- 
ning window centered on the current pitch epoch and 
adjusted so that it covers two-pitch periods. Other win- 
dows are also possible. The residual signal is then proc- 
essed at 166 to extract data points for use in the arc- 

10 length calculation. 

[0043] The residual signal may be processed in a 
number of different ways to extract the data points. As 
illustrated at 168, the procedure may branch to one or 
more of a selected class of processing routines. Exam- 

75 pies of such routines are illustrated at 1 70. Next the arc- 
length (or square-length) calculation is performed at 
172. The resultant value serves as a cost parameter. 
[0044] After calculating the cost parameter for the 
initial set of filter parameters, the filter parameters are 

20 selectively adjusted at step 174 and the procedure is 
iteratively repeated as depicted at 176 until a minimum 
cost is achieved. 

[0045] Once the minimum cost is achieved, the 
extracted residual signal corresponding to that minimum 
25 cost is used at step 178 as the source signal. The filter 
parameters associated with the minimum cost are used 
as the fitter parameters (step 180) in a source-filter 
model. 

[0046] For further details regarding source signal 
30 and filter parameter extraction, refer to co-pending U.S. 
patent application, "Method and Apparatus to Extract 
Formant-Based Source-Filter Data for Coding and Syn- 
thesis Employing Cost Function and Inverse Filtering," 

Serial Number , filed 

35 by Steve Pearson and assigned to the 

assignee of the present invention. 
[0047] While the invention has been described in its 
presently preferred embodiment, it will be understood 
that the invention is capable of certain modification with- 
40 out departing from the spirit of the invention as set forth 
in the appended claims. 

Claims 

45 1. A concatenate speech synthesizer, comprising: 

a database containing (a) demi-syllable wave- 
form data associated with a plurality of demi- 
syllables and (b) filter parameter data associ- 
50 ated with said plurality of demi-syllables; 

a unit selection system for extracting selected 
demi-syllable waveform data and filter parame- 
ters from said database that correspond to an 
input string to be synthesized; 
55 a waveform cross fade mechanism for joining 

pairs of xtracted demi-syllable waveform data 
into syllable waveform signals; 
a filter parameter cross fade mechanism for 
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defining a set of syllable-level tilt r data by 
interpolating said extracted filter parameters; 
and 

a filter module receptive of said set of syllable- 
level filter data and operative to process said s 
syllable waveform signals to generate synthe- 
sized speech. 

2. The synthesizer of claim 1 wherein said waveform 
cross fade mechanism operates in the time domain, w 

3. The synthesizer of claim 1 wherein said filter 
parameter cross fade mechanism operates in the 
frequency domain. 

15 

4. The synthesizer of claim 1 wherein said waveform 
cross fade mechanism performs a linear cross fade 
upon two demi-syllables over a predefined duration 
corresponding to a syllable. 

20 

5. The synthesizer of claim 1 wherein said filter 
parameter cross fade mechanism interpolates 
between the respective extracted filter parameters 
of two demi-syllables. 

25 

6. The synthesizer of claim 1 wherein said filter 
parameter cross fade mechanism performs linear 
interpolation between the respective extracted filter 
parameters of two demi-syllables. 

30 

7. The synthesizer of claim 1 wherein said filter 
parameter cross fade mechanism performs sigmoi- 
dal interpolation between the respective extracted 
filter parameters of two demi-syllables. 
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